Search CORE

1 research outputs found

Hybrid filter-wrapper approaches for feature selection

Author: Gili Fernández de Romarategui David
Publication venue: Universitat Politècnica de Catalunya
Publication date: 28/10/2021
Field of study

Durant les darreres dècades, molts sectors empresarials han adoptat les tecnologies digitals, emmagatzemant tota la informació que generen en bases de dades. A més, amb l'auge de l'aprenentatge automàtic i la ciència de les dades, s'ha tornat econòmicament rendible utilitzar aquestes dades per resoldre problemes del món real. No obstant això, a mesura que els conjunts de dades creixen en mida, cada vegada és més difícil determinar exactament quines variables són valuoses per resoldre un problema específic. Aquest projecte estudia el problema de la selecció de variables, que intenta seleccionar el subconjunt de variables rellevants per a una determinada tasca predictiva. En particular, ens centrarem en els algoritmes híbrids que combinen mètodes filtre i embolcall. Aquesta és una àrea d'estudi relativament nova, que ha obtingut bons resultats en conjunts de dades amb grans dimensions perquè ofereixen un bon compromís entre velocitat i precisió. El projecte començarà explicant diversos mètodes filtre i embolcall i seguidament ensenyarà com diversos autors els han combinat per obtenir nous algoritmes híbrids. També introduirem un nou algoritme al qual anomenarem BWRR, que utilitza el popular filtre ReliefF per guiar una cerca cap enrere. La principal novetat que proposem és recomputar ReliefF en certs punts per guiar millor la cerca. Addicionalment, introduirem diverses variacions de l'algoritme. També hem realitzat una extensa experimentació per a provar el nou algoritme. Primerament, hem treballat amb conjunts de dades sintètiques per esbrinar quins factors afectaven el rendiment. Seguidament, l'hem comparat amb l'estat de l'art en diversos conjunts de dades reals.Over the last couple of decades, more business sectors than ever have embraced digital technologies, storing all the information they generate in databases. Moreover, with the rise of machine learning and data science, it has become economically profitable to use this data to solve real-world problems. However, as datasets grow larger, it has become increasingly difficult to determine exactly which variables are valuable to solve a given problem. This project studies the problem of feature selection, which tries to select a subset of relevant variables for a specific prediction task from the complete set of attributes. In particular, we have mostly focused on hybrid filter-wrapper algorithms, a relatively new branch of study, that has seen great success in high-dimensional datasets because they offer a good trade-off between speed and accuracy. The project starts by explaining several important filter and wrapper methods and moves on to illustrate how several authors have combined them to form new hybrid algorithms. Moreover, we also introduce a new algorithm called BWRR, which uses the popular ReliefF filter to guide a backward wrapper search. The key novelty we propose is to recompute the ReliefF rankings at several points to better guide the search. In addition, we also introduce several variations of this algorithm. We have also performed extensive experimentation to test this algorithm. In the first phase, we experimented with synthetic datasets to see which factors affected the performance. After that, we compared the new algorithm against the state-of-the-art in real-world datasets

UPCommons. Portal del coneixement obert de la UPC